# install.packages("ggplot2","dplyr")
library(ggplot2)
library(dplyr)
library(tidyr)
The purpose of this notebook is to demonstrate some of what is possible for visualisation of a text. Quantitative analysis is a tool that can help to answer some questions, but it is not always useful and there are many questions it cannot address. I hope to demonstrate below some of the things that can be done, and hopefully it will be more inspiring that intimidating.
First, there must be a corpus or digitized text that can be analysed computationally. For this demonstration, I’ve used a corpus of Shakespeare’s plays and adapted some code from a Kaggle notebook.
# R must be at least 3.3.1 for `tm` and `slam` to work.
# install.packages("tm")
# install.packages("SnowballC")
library(tm)
#system("ls ../input") # do we need this?
Before we get into anything fun, we have to see what the corpus looks like; that is, how the data frame is structured. These are the first six lines of the corpus. NB: there is currently a small bug in the software that prevents the data from being shown neatly. It should be fixed soon.
shak<-read.csv("../data/Shakespeare_data.csv",header = TRUE, as.is = TRUE)
#shak<-na.omit(shak)
head(shak)
Dataline Play PlayerLinenumber ActSceneLine Player
1 1 Henry IV NA
2 2 Henry IV NA
3 3 Henry IV NA
4 4 Henry IV 1 1.1.1 KING HENRY IV
5 5 Henry IV 1 1.1.2 KING HENRY IV
6 6 Henry IV 1 1.1.3 KING HENRY IV
PlayerLine
1 ACT I
2 SCENE I. London. The palace.
3 Enter KING HENRY, LORD JOHN OF LANCASTER, the EARL of WESTMORELAND, SIR WALTER BLUNT, and others
4 So shaken as we are, so wan with care,
5 Find we a time for frighted peace to pant,
6 And breathe short-winded accents of new broils
Each column is labeled and the content of the column is consistent for each row (all 111396 of them!). Some of the rows may not be useful. Some contain empty cells (labeled NA). Some contain a lot of information and we might need to do some processing on them before we can use the information quantitatively.
The first thing we’ll look at is word frequency, or how often a string (in this case “love”) occurs in the data frame. To do this, we must identify every time the word “love” appears and highlight it in a way so that it can be counted based on different properties of its environment (e.g., by play, by player, by scene, etc).
Here are the first 10 rows of a data frame that contains the number of times “love” appears in each play. It’s been sorted in descending order, but doesn’t contain any other information about where and when the word occurs.
# play level word frequency
plays <- unique(shak$Play)
loveFreq<-numeric()
for (i in 1:length(plays)){
text <- Corpus(VectorSource(paste(shak[shak$Play==plays[i],]$PlayerLine,collapse=" ")))
text <- tm_map(text, removePunctuation)
text <- tm_map(text, PlainTextDocument)
text <- tm_map(text, removeWords, stopwords('english'))
# stemming to merge all "loved", "loving" into one
text <- tm_map(text, stemDocument)
tdm <- TermDocumentMatrix(text)
loveFreq[i]<-as.numeric(slam::row_sums(tdm)["love"])
}
lPlay <- data.frame(plays,loveFreq)
lPlay <- na.omit(lPlay)
# order the plays based on the occurence of love
lPlay<-lPlay[order(-lPlay$loveFreq),]
head(lPlay,10)
plays loveFreq
35 Two Gentlemen of Verona 188
28 Romeo and Juliet 160
6 As you like it 138
22 A Midsummer nights dream 128
17 Loves Labours Lost 125
23 Much Ado about nothing 122
24 Othello 108
33 Troilus and Cressida 87
27 Richard III 86
11 Hamlet 85
We can also look at which players say “love” the most over the course of their appearences. These are only the top 10 players who use the word “love” most.
# player level word frequency
players <- unique(shak$Player)
loveFreq <- numeric()
for (i in 1:length(players)){
text <- Corpus(VectorSource(paste(shak[shak$Player==players[i],]$PlayerLine,collapse=" ")))
text <- tm_map(text, removePunctuation)
text <- tm_map(text, PlainTextDocument)
text <- tm_map(text, removeWords, stopwords('english'))
text <- tm_map(text,stemDocument)
tdm <- TermDocumentMatrix(text)
loveFreq[i] <- as.numeric(slam::row_sums(tdm)["love"])
}
lPlayer <- data.frame(players,loveFreq)
lPlayer <- na.omit(lPlayer)
#order
lPlayer <- lPlayer[order(-lPlayer$loveFreq),]
head(lPlayer,10)
players loveFreq
904 PROTEUS 59
190 ROSALIND 57
771 ROMEO 56
169 HELENA 46
906 JULIA 41
650 IAGO 40
572 JULIET 37
38 GLOUCESTER 36
517 BIRON 36
639 BENEDICK 36
We can also look at the bottom of the list. These are 10 players who only say “love” once, although there are likely many others who are also tied for last.
tail(lPlayer,10)
players loveFreq
890 CALCHAS 1
899 SIR TOBY BELCH 1
903 FABIAN 1
914 Third Outlaw 1
917 ARCHIDAMUS 1
922 MAMILLIUS 1
925 PAULINA 1
932 PERDITA 1
933 DORCAS 1
934 MOPSA 1
Is this useful to you? Can word frequency by character/player, scene, act, play, or author help to answer any of your research questions?
I think the main way quantitative analysis can be of use to the humanities is by visualising properties of the text that might not be immediately apparent from reading. Word frequency is one of these properties, since we (as humans) don’t typically keep track of how often each characters says any given word. If you’re interested in how different characters or different authors make use of certain words or phrases, visualising the distribution of those strings might uncover patterns that are otherwise difficult to find.
For instance, maybe you are curious how the longer and shorter plays compare. Instead of hand-counting each, we can graph and order them. Based on this graph, you don’t need to know exactly how long each is, but you can see that Othello is much longer than Loves Labours Lost, which can inform how you approach the comparison.
shak %>%
group_by(Play) %>%
summarise(n = n()) %>%
ggplot(., aes(x=reorder(Play, n),y=n)) +
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Length of Shakespeare's plays") +
theme(legend.position="none") +
xlab("Play") +
ylab("Number of lines")
Within a single play, maybe we want to know which characters are the chattiest. We can visualise the number of lines of text per character to get a sense of who is dominating the stage.
Obviously, it’s Hamlet.
shak %>%
filter(Play == "Hamlet") %>%
group_by(Player) %>%
summarise(n = n()) %>%
ggplot(., aes(x=reorder(Player, n),y=n)) +
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Speech in Hamlet") +
theme(legend.position="none") +
xlab("Player") +
ylab("Number of lines")
One property of much real-life, natural language data (and many other phenomena in human behaviour) is that frequency of different events or items tend to follow a Zipf distribution. This just means that there are a very small number of incredibly frequent things, and a very large number of very infrequent things. One property of this distribution is that it can look like a very steep curve when plotted normally, but when plotted logarithmically, it looks more like a straight line.
Since it appears that the number of lines per player in Hamlet follow a Zipf curve, we can easily change the scale of the x-axis (the bottom of the chart) to a logarithmic scale. This means that each unit of distance from the lower left is 10 times the value of the previous unit. The distance from 0 to 1 will appear the same as between 1 and 10, which will appear the same as between 10 and 100, and then again between 100 and 1000. This kind of scale will deemphasize the absolute differences in frequency among the most frequent things and help resolve nuanced differences among the least frequent things.
When we make this change to the from above visualisation, suddenly we see a lot of nuance in the “long tail” of the data. The players with the fewest lines don’t all still have the same number, and this might be useful information about who speaks when.
shak %>%
filter(Play == "Hamlet") %>%
group_by(Player) %>%
summarise(n = n()) %>%
ggplot(., aes(x=reorder(Player, n),y=n)) +
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Speech in Hamlet") +
theme(legend.position="none") +
xlab("Player") +
ylab("Number of lines (logarithmic scale)") +
scale_y_log10()
We can also look across plays for frequency. By comparing which plays have the word “love” the most often, we might be able to group them (perceptually) into plays about love and those that are not. Maybe?
lPlay %>%
ggplot(., aes(x=reorder(plays, loveFreq),y=loveFreq)) +
geom_bar(aes(),stat="identity") +
coord_flip() +
ggtitle("Love in each play") +
# theme(legend.position="none") +
xlab("Play") +
ylab("frequency of the word 'love'") +
theme(legend.position = "none")
One thing that graphs can do very easily is give you a way to identify trends when you sort events (e.g., plays) into multiple different categories. For instance, the frequency graph above is interesting, but there are so many plays and as a non-expert, I can’t tell you what each is about, what style it is written in, or whether I’d expect it to be about “love” or not. So, we can add another dimension of information. In the following graph, each color represents a different category (as determined by Wikipedia’s First Folio page, plus information about the “late romances”). Now, we can see if there are trends for different categories to mention “love” more or less than the others.
lPlayCat <- lPlay
lPlayCat$category <- NA
lPlayCat$category[lPlayCat$plays == "A Comedy of Errors" |
lPlayCat$plays == "As you like it" |
lPlayCat$plays == "Alls well that ends well" |
lPlayCat$plays == "Loves Labours Lost" |
lPlayCat$plays == "Measure for measure" |
lPlayCat$plays == "Merchant of Venice" |
lPlayCat$plays == "Merry Wives of Windsor" |
lPlayCat$plays == "A Midsummer nights dream" |
lPlayCat$plays == "Much Ado about nothing" |
lPlayCat$plays == "Taming of the Shrew" |
lPlayCat$plays == "Twelfth Night" |
lPlayCat$plays == "Two Gentlemen of Verona"] <- "comedy"
lPlayCat$category[lPlayCat$plays == "Pericles" |
lPlayCat$plays == "Cymbeline" |
lPlayCat$plays == "A Winters Tale" |
lPlayCat$plays == "The Tempest"] <- "romance"
lPlayCat$category[lPlayCat$plays == "King John" |
lPlayCat$plays == "Richard II" |
lPlayCat$plays == "Richard III" |
lPlayCat$plays == "Henry IV" |
lPlayCat$plays == "Henry V" |
lPlayCat$plays == "Henry VI Part 1" |
lPlayCat$plays == "Henry VI Part 2" |
lPlayCat$plays == "Henry VI Part 3" |
lPlayCat$plays == "Henry VIII" |
lPlayCat$plays == "Coriolanus" |
lPlayCat$plays == "Julius Caesar" |
lPlayCat$plays == "Antony and Cleopatra" |
lPlayCat$plays == "King Lear" |
lPlayCat$plays == "macbeth"] <- "history"
lPlayCat$category[lPlayCat$plays == "Titus Andronicus" |
lPlayCat$plays == "Romeo and Juliet" |
lPlayCat$plays == "Hamlet" |
lPlayCat$plays == "Troilus and Cressida" |
lPlayCat$plays == "Othello" |
lPlayCat$plays == "Timon of Athens"] <- "tragedy"
# sort(unique(lPlay$plays))
lPlayCat %>%
ggplot(., aes(x=reorder(plays, loveFreq),y=loveFreq)) +
geom_bar(aes(fill=category),stat="identity") +
coord_flip() +
ggtitle("Love in each play") +
# theme(legend.position="none") +
xlab("Play") +
ylab("frequency of the word 'love'")
It seems to me that comedies and tragedies discuss “love” the most, whereas histories and the late romances discuss it the least. Is this intuitive? Maybe. But there’s a problem. A Comedy of Errors has the fewest mentions of “love”, but it’s also the shortest play, so it has the fewest words overall. What we really want to see is the proportion of “love”-frequency per play, not the raw counts. To do that, we have to add in the total length of each play to the data frame.
playLength <- shak %>%
group_by(Play) %>%
summarise(n = n())
lPlayCat$length <- NA
for (i in 1:length(playLength$n)) {
lPlayCat$length[lPlayCat$plays==playLength$Play[i]] <- playLength$n[playLength$Play==playLength$Play[i]]
}
lPlayCat %>%
mutate(proportion = loveFreq/length) %>%
ggplot(., aes(x=reorder(plays, proportion),y=proportion)) +
geom_bar(aes(fill=category),stat="identity") +
coord_flip() +
ggtitle("Love in each play") +
# theme(legend.position="none") +
xlab("Play") +
ylab("proportional frequency of the word 'love'")
Not a whole lot has changed, but I think the distribution of comedies and tragedies is even more pronounced. And, we have more information about A Comedy of Errors, which is still very close to the bottom of the graph. Not every comedy is about love, it seems.
Finally, we can generate these same types of graphs for different subgroups, too. Here’s one example, where we look at the number of lines each player has, focusing only on players who have greater than 700 lines. We can also see if there are any trends in these top speakers by play category. It seems to me that the histories dominate, but Hamlet and Iago dominate the scene (so to speak).
shak %>%
group_by(Play,Player,category) %>%
summarise(n = n()) %>%
filter(n > 700) %>%
ggplot(., aes(x=reorder(Player, n),y=n)) +
geom_bar(aes(fill=category),stat="identity") +
coord_flip() +
ggtitle("Amount of lines by character") +
# theme(legend.position="none") +
xlab("Player") +
ylab("Number of lines")
Is this because histories and tragedies tend to be longer plays, overall? Quite possibly:
Single words might not be able to tell us much about the texts, which is why examining collocations is such a popular technique. What’s the difference between “the king”, “kill the king”, and “kiss the king”? A lot, but we won’t know if we only look for instances of “king”. There is where n-grams become useful. N-grams are sets of adjacent words, calculated by assigning a number to ‘n’. That is, if we want sets of two words, we talk about bigrams. If we want sets of three words, we talk about trigrams.
First, we can look at the corpus as a list of words, rather than a list of lines (by act and scene).
shak %>%
as_tibble(.) %>%
unnest_tokens(tbl=., input = PlayerLine, output = word) %>%
as.data.frame() %>% head(10)
Dataline Play PlayerLinenumber ActSceneLine Player category word
1 1 Henry IV NA history act
2 1 Henry IV NA history i
3 2 Henry IV NA history scene
4 2 Henry IV NA history i
5 2 Henry IV NA history london
6 2 Henry IV NA history the
7 2 Henry IV NA history palace
8 3 Henry IV NA history enter
9 3 Henry IV NA history king
10 3 Henry IV NA history henry
Now, we can automate the process of counting how often each word occurs. But, of course, certain words are going to be extremely common (see Zipf’s Law), and those words are unlikely to be informative.
shak %>%
as_tibble(.) %>%
unnest_tokens(tbl=., input = PlayerLine, output = word) %>%
count(word, sort = TRUE) %>%
as.data.frame() %>% head(10)
word n
1 the 27052
2 and 25082
3 i 20142
4 to 18984
5 of 15862
6 a 14196
7 you 13347
8 my 11875
9 in 10540
10 that 10441
Once we’ve filtered out our stop-words, the most frequent words look quite different.
shak %>%
as_tibble(.) %>%
unnest_tokens(tbl=., input = PlayerLine, output = word) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
as.data.frame() %>% head(10)
word n
1 thou 5193
2 thy 3727
3 thee 3024
4 lord 2621
5 sir 2454
6 enter 2338
7 love 1927
8 hath 1845
9 king 1500
10 tis 1384
However, Shakespeare uses a lot of words that aren’t in our default stop-word list, so we can append our own custom list.
word <- c(NA,"thou","thee","thy","thine","dost","shalt","wilt","hast","hath","scene","tis","ii","iii","iv","v","vi","vii")
lexicon <- rep("shakespeare",length(word))
new_stop <- cbind(word,lexicon)
shak_stop <- rbind(new_stop,stop_words)
Here is a visualisation of how stop words affect the corpus.
Before we try to compare across plays, let’s see what the most common bigrams are overall (after being filtered by the custom stop words list).
shak %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
as.data.frame() %>% head(10)
word1 word2 n
1 enter king 101
2 mine eyes 95
3 king henry 88
4 sir john 80
5 mark antony 76
6 mine honour 71
7 king richard 51
8 god save 48
9 gracious lord 46
10 noble lord 46
If we choose a subset of words, we can look at how they are distributed across different plays. In this case, we can compare death, king, love, and sweet across six plays. Unsurprisingly, Romeo and Juliet uses the word love more than any other play, although Midsummer Night’s Dream is close. King is also much more common in plays about kings (surprise, surprise).
shak[,c(2,5,6)] %>%
as_tibble() %>%
unnest_tokens(tbl=., input = PlayerLine, output = word) %>%
filter(word=="love" | word =="king" | word=="death" | word=="sweet") %>%
#add_count(Player) %>%
group_by(Player,Play,word) %>%
summarise(n=n()) %>%
#anti_join(stop_words) %>%
filter( Play == "Hamlet" |
Play == "King Lear" |
Play == "A Midsummer nights dream" |
Play == "Othello" |
Play == "Henry V" |
Play == "Romeo and Juliet") %>%
arrange(desc(n)) %>%
ggplot(., aes(x=word,y=n)) +
geom_bar(aes(fill=word),stat="identity") +
# coord_flip() +
facet_wrap(~Play)
But there are more interesting words you could compare, certainly. These are just one example.
It’s probably much more interesting to look at collocation than simple word frequency. After all, it gives more context. However, it also reduces the number of tokens substantially.
Here, we can visualise three pairs of gendered noun phrases:
| Masculine | Feminine |
|---|---|
| my lord | my lady |
| my father | my mother |
| my husband | my wife |
What kinds of information can we see in the graph?
I think the most exciting way to use n-grams is probably network graphs. These graphs show what words co-occur, and in what order. Moreover, they can encode a number of dimensions visually, which would be very difficult to calculate by hand or plot in a more standard quantitative method.
set.seed(814)
a <- grid::arrow(type = "closed", angle=22.5, length = unit(.1, "inches"))
p1 <- shak %>%
filter(Play=="Hamlet") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 6) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkblue", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void()
set.seed(814)
p2 <- shak %>%
filter(Play == "Twelfth Night") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 6) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkred", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "salmon", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void()
set.seed(814)
p3 <- shak %>%
filter(Play == "Romeo and Juliet") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 6) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkgreen", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "green2", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void()
multiplot(p1,p2,p3,cols=3)
set.seed(814)
a <- grid::arrow(type = "closed", angle=22.5, length = unit(.1, "inches"))
p1 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play=="Hamlet") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkblue", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("Hamlet")
set.seed(814)
p2 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play == "Twelfth Night") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkred", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "salmon", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("Twelfth Night")
set.seed(814)
p3 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play == "Romeo and Juliet") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkgreen", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "green2", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("Romeo and Juliet")
set.seed(814)
p4 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play == "Othello") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkorange", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "orange", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("Othello")
set.seed(814)
p5 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play == "Henry IV") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="cadetblue4", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "cyan", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("Henry IV")
set.seed(814)
p6 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play == "The Tempest") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="violet", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "magenta", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("The Tempest")
multiplot(p1,p2,p3,p4,p5,p6,cols=3)
This should give us a better idea of slightly looser connections
set.seed(814)
a <- grid::arrow(type = "closed", angle=22.5, length = unit(.1, "inches"))
shak %>%
as_tibble() %>%
filter(ActSceneLine != "") %>%
unnest_tokens(input = PlayerLine, output = trigram, token = "ngrams", n = 3) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
filter(!word3 %in% shak_stop$word) %>% # filters stop words from third column
count(word1, word2, word3, sort = TRUE) %>%
filter(n > 2) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkblue", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void()
What happens if we treat the first pair and second pair of trigrams as separate bigrams and graph them as before?
shak %>%
as_tibble() %>%
#filter(Play == "Hamlet" | Play == "Loves Labours Lost" | Play == "A Winters Tale") %>%
filter(ActSceneLine != "") %>%
mutate(ActSceneLine2 = ActSceneLine) %>%
separate(ActSceneLine2, c("act", "scene", "line")) %>%
count(Play,act,scene, sort=TRUE) %>%
transmute(play=Play, act=as.numeric(act), scene=as.numeric(scene), n=n)
[38;5;246m# A tibble: 737 x 4[39m
play act scene n
[3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m Loves Labours Lost 5 2 972
[38;5;250m 2[39m A Winters Tale 4 4 929
[38;5;250m 3[39m Hamlet 2 2 616
[38;5;250m 4[39m King John 2 1 609
[38;5;250m 5[39m The Tempest 1 2 596
[38;5;250m 6[39m Cymbeline 5 5 584
[38;5;250m 7[39m Measure for measure 5 1 580
[38;5;250m 8[39m Timon of Athens 4 3 577
[38;5;250m 9[39m Richard III 4 4 561
[38;5;250m10[39m A Winters Tale 1 2 539
[38;5;246m# ... with 727 more rows[39m
What this all seems to tell us is that we can visualise the structure of the play, separate from their content. Is this useful to you?